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(54) Processor and method for reducing its power usage 



(57) A method of optimizing assembly code of a 
VLIW processor (10) or other processor that uses mul- 
tiple-instruction words (20), each of which comprise In- 
structions to be executed on different functional units 
(11 d and lie) of the processor (10). The instruction 



words (20) are modified in accordance with one or more 
code optimization techniques (FIGURE 6). Typically, the 
modifications tend to result in fewer cycle-to-cycle bit 
changes in the machine code, which results in reduced 
power consumption. 
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Deecrlptlon 

TECHNICAL FIELD OF THE INVENTION 

5 [0001] This invention relates to processors, and more particularly, but not exclusively, to methods of using program- 
ming Instructions In a manner that reduces the power consumption of a processor. 

BACKGROUND OF THE INVENTION 

10 [0002] Power efficiency for processor-based equipment is becoming increasingly important as people are becoming 
more attuned to energy consewation issues. Specific considerations are the reduction of thermal effects and operating 
costs. Also, apart from energy conservation, power efficiency is a concern for battery-operated processor-based equip- 
ment, where it is desired to minimize battery size so that the equipment can be made small and lightweight. The 
'processor-based equipment' can be either equipment designed especially for general computing or equipment having 

IS an embedded processor. 

[0003] From the standpoint of processor design, a number of techniques have been used to reduce power usage. 
These techniques can be grouped as two basic strategies. First, the processor's circuitry can be designed to use less 
power. Second, the processor can be designed In a manner that pemnits power usage to be managed. 
[0004] On the other hand, given a particular processor design, its programming can be optimized for reduced power 

20 consumption. Thus, from a programmer's standpoint, there is often wore than one way to program a processor to 
perform the same function. For example, algorithms written in high level programming languages can be optimized for 
efficiency in terms of time and power. Until recently, at the assembly language level, most optimization techniques have 
been primarily focussed on speed of execution without particular regard to power use. 

[0005] The programmer's task of providing power efficient code can be performed manually or with the aid of an 
25 automated code analysis tool. S uch a tool might analyze a given program so to provide the programmer with information 
about its power usage information. Other such tools might actually assist the programmer in generating optimized code. 
[0006] U.S. Patent No. 5.557,557, to Franz, et al., entitled "Processor Power Profiler", assigned to Texas Instruments 
Incorporated, describes a method of modeling power usage during program execution. A power profiler program an- 
alyzes the program and provides the programmer with information about energy consumption. A power profiler Is also 
30 described in U.S. Patent Serial No. 06/046,811, to L Hurd, entitled ' Module-Configurable, Full-Chip Power Profiler", 
assigned to Texas Instnjments Incorporated. 

[0007] Once the power requirements of a particular program are understood, the code can be optimized. Automating 
this aspect of programming requires a code generation tool that can restructure computer code. Internal algorithms as 
well as supporting functions, for minimum power usage. 

35 

SUMMARY OF THE INVENTION 

[0008] One aspect of an embodiment of the invention is a method of optimizing computer programs for power usage. 
It is based on the recognition that power consumption is reduced when there is a minimum of change in the machine- 

40 level representation of the program from each CPU cycle to the next. The method is useful for various types of proc- 
essors that execute •multiple-Instruction words' (as defined herein) by different functional units of the processor. Ex- 
amples of such processors are VUW (very long instruction word) processors and dual datapath processors. 
[0009] The method comprises a set of steps, any of one which may be performed independently Each step involves 
scanning the code and comparing a given field or other code sequence within instructions. Generally, it is the coda 

^ syntax that is of interest, as opposed to Its functionality. It is determined if there are code sequences where cycle-to- 
cycle bit changes in the machine code representation of that code sequence can be minimized. Then, the code is 
modified if this can be done without adversely affecting code functionality. 

[001 0] For example, one aspect of an embodiment of the Invention is a method where the code sequences of interest 
are functional unit assignments. Typically, each Instruction of the instruction word occupies a "slof of the word. For 
so each slot, the field that identities the functional unit is scanned. Cycle-to-cycle bit changes in this field are reduced by 
re-arranging instnjctions within instruction words. Because instructions are merely rearranged, code functionality is 

not affected. 

[0011] An advantage of an embodiment of the invention Is that it is directed to optimization at the processor archi- 
tecture level, rather than to high level programming. This permits a processor to be programmed in a manner that is 
ss most efficient for that processor. The method can be easily adapted to the characteristics of the processor and its 
instnjction set. 
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BRIEF DES CRIPTION OF THE DRAWINGS 

[0012] FIGURE 1 is a block diagram of a VLIW DSP processor. 

[001 3] FIGURE 2 illustrates the basic format of a fetch packet used by the processor of FIGURE 1 
[0014] FIGURE 3 Illustrates an example of the fetch packet of FIGURE 2. 

[0015] FIGURE 4A illustrates the mapping of the instruction types for the processor of FIGURE 1 to the functional 
units in its datapaths. 

[001 6] FIGURE 4B is a table describing the mnemonics of FIGURE 4A. 
[0017] FIGURE 5 illustrates a fetch packet having multiple execute packets. 

[0018] FIGURE 6 illustrates a code optimization process in accordance with an embodiment of the invenlbn 

[0019] FIGURES 7A and 7B illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed In accordance with Step 63 of FIGURE 6. 

[0020] FIGURES 8A and SB illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed in accordance with Step 63 of FIGURE 6 

[0021] FIGURES 9A and 98 illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed in accordance with Step 63 of FIGURE 6 

[0022] FIGURES IDA and 108 illustrate an example of unoptimized code together with the corresponding optimized 

code, respectiyety, where the optimization has been perfomied in accordance with Step 64 of FIGURE 6. 

[0023] FIGURES 11 A and 11 B illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed in accordance with Step 65 erf FIGURE 6 

[0024] FIGURES 1 2A and 1 2B illustrate an example of unoptimized code together wrth the corresponding optimized 

code, respectively, where the optimization has been perfomned in accordance with Step 65 of FIGURE 6. 

[0025] FIGURES 1 3A and 1 38 illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been perfomned in accordance with Step 67 of FIGURE 6. 

[0026] FIGURES 1 4A and 1 48 illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed in accordance with Step 68 of FIGURE 6. 

[0027] FIGURES 1 5A and 1 58 illustrate an example of unoptimized code together with the corresponding optimized 

code, respectively, where the optimization has been performed in accordance with Step 68 of FIGURE 6. 

DETAILED DESCRIPTION OF THE INVENTION 

[0028] Embodiments of the invention described herein are directed to power management for microprocessors An 
underlying principle of operation is that the programming provided to the processor can be optimized so as to reduce 
power usage. Given a particular instruction set, a program using these instructtons can be analyzed to detect the 
presence of non-optimal instruction sequences. These sequences can be modified so that power usage is more effi- 
cient, without adversely affecting code functk)nality. 

[0029] The method of embodiments of the invention is most useful with VU W (very long instruction word) processors 
which are characterized by their ability to execute multiple instructions in parallel using different functional units within 
the processor. Embodiments of the invention are also useful with "dual datapath' processors, which execute two in- 
structions in parallel on two datapaths. Both types of processors execute "multiple-instruction worefs' In parallel in more 
than one functional unit. However, parallelism is not a limilatfon of embodiments of the inventran. and any processor 
that fetches and decodes more than one instruction at a lime will benefit from the optimization process. As explained 
betow, for such processors, cycle-to-cycle instruction fetching, decoding, and dispatching can be optimized for power 
if the code is arranged properly. 

[0030] In light ot the preceding paragraph., the temri "processor" as used herein may include various types of micro 
controllers and digital signal processors (DSPs). To this end, the following description is in terms of DSPs - the TMS320 
family of DSPs and the TMS320C6x DSP in particular. However, this selection of a particular processor is for puiposes 
of description and example only, 



Processor Ovennew 



[0031] FIGURE 1 is a block diagram of a DSP processor 10. As explained betow. processor 10 has a VLIW archi- 
tecture, and fetches multiple-instruction words (as fetch packets') to be executed in parallel (as "execute packets") 
during a single CPU ctock cycle. In the example of this descriptton, processor 10 operates at a 5 nanosecond CPU 
cycle lime and executes up to eight instructions every cycle. 

[0032] Processor 10 has a CPU core 11. which has a program fetch unit lla. and instnjction dispatch and decode 
units lib and lie, respectively. To execute the decoded instructtons, processor 10 has two datapaths lid and lie 
[0033] Instructton decode unit lie delivers execute packets having up to eight Instructtons to the datapath units lid 
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and lie evary clock cycle. Datapaths 11d and 11e each include 16 general-purpose registers. Datapaths lldand lie 
each also include tour functional units (L, S, M. and D). which are connected to the general-purpose registers. Thus, 
processor 10 has eight functional units, each of which may execute one of the instructions in an execute packet. Each 
functional unit has a set ot instruction types that it is capable of executing. 
^ [0034] The control registers 1 1 f provide the means to configure and control various processor operations. The control 
logic unit 11g has logic for control, test, emulation; and interrupt functions. 

[0035] Processor 10 also connprises program memory 12. data memory 13, and timer 14. Its peripheral circuitry 
Includes a direct memory access (DMA) controller 15, external memory Interface 16. host port 17, and power down 
logic 18. The power down logic 18 can halt CPU activity, peripheral activity, and timer activity to reduce power con- 
10 sumption. These power down modes, as well as features of processor 10 other than the features of embodiments of 
the present invention, are described in U.S. Patent Serial No. 60/046.811, referenced in the Background and incorpo- 
ratec herein by reference. 

[0036] Processor 1 0 executes RISC-like code, and has an assembly language Instruction set In other words, each 
of its VLIWs comprises RISC-type instructions. A program written with these instructions is converted to machine code 
IS by an assembler Processor 10 does not use microcode or an internal microcode interpreter, as do some other proc- 
essors. However the invention described herein could be applicable regardless of whether RISCHIke instructions con- 
trol the processor or whether instructrans are internally interpreted to a lower level. 

[0037] In the example of this description, eight 32-blt instructions are combined to make the VLI W. Thus, in operation, 
32-bit instructions are fetched eight at a time from program memory 12, to make a 256-bit instruction word. The 'fetch 

20 packet" is comprised of these eight instructions fetched from memory 1 2. 

[0038] FIGURE 2 illustrates the basic fomnat of the fetch packet 20 used by processor 1 0. Each of the eight instruc- 
tions in fetch packet 20 is placed in a tocation referred to as a "slot" 21 . Thus, fetch packet 20 has Slots 1 , 2,. ..8. 
[0039] Processor 10 differs from other VLIW processors in that the entire fetch packet is not necessarily executed 
in one CPU cycle. All or part of a fetch packet is executed as an 'execute packet". In other words, a fetch packet can 

25 be fully parallel, fully serial, or partially serial. In the case of a fully or partially serial fetch packet, where the fetch 
packet's instnictions require more than one cycle to execute, the next fetch can be postponed. This distlnctbn between 
fetch packets and execute packets pemnits every fetch packet to contain eight instructions, without regard to whether 
they are all to be executed in parallel. 

[0040] For processor 1 0, the executbn grouping of a fetch packet 20 is specified by a "p-bit" 22 in each instruction. 
50 In operation, instruction dispatch unit lib scans the p-bits, and the state ot the p-bit of each instruction determines 
whether the next instruction will be executed in parallel with that instruction. If so, its places the two instructkwis are in 
the same execute packet to be executed in the same cycle. 

[0041 ] FIGURE 3 illustrates an example of a fetch packet 20. Whereas FIGURE 2 illustrates the format for the fetch 
packet 20. FIGURE 3 illustrates an example of instructions that a fetch packet 20 might contain. A fetch packet 20 
35 typically has five to eight instructions, and the fetch packet 20 of FIGURE 3 has seven. Each instructfon has a number 
of fieWs. which ultimately are expressed in bit-level machine code. 

[0042] The 1 1 characters signify that an instruction is to execute in parallel with the previous instruction, and is coded 

as p-bit 22. As indicated, fetch packet 20 is fully parallel, and may be executed as a single execute packet. 

[0043] The square brackets [] signify a conditional instructton, surrounding the identifier of a condition register Thus, 

40 the first instruction in FIGURE 3 is conditioned on register A2 being nonzero. A ! character signifies 'nof. so that a 
condition on A2 being zero would be expressed as [!A2]. The conditional register fieki comprises these klentifiers. 
[0044] The cpfield contains an instruction type from the instruction set of processor 1 0. Following the instruction type 
is the designation of the functional unit that will execute the instruction. As stated above in connection with FIGURE 
1. each of the two datapaths lid and lie has four functional units. These functbnal units are L (logical), S (shift). M 

-tfs (multiply), and D (data). The opfield thus has the syntax (instruction typel[functiorBl unit identifier]. 

[0045] Some instruction types can be performed by only one functbnal unit and some can be performed by one of 
a number of them. For example, only the M unit can perform a multiply (MPY). On the other hand, an add (ADD) can 
be performed by the L, S, or D unit. The correspondence of f unctbnal units to instructions is referred to herein as their 
'mapping'. 

so [0046] FIGURE 4A is a table illustrating, for processor 10, the mapping of instruction types to functional units. It Is 
useful for an understanding of the examples set out below in connection with code optimization. FIGURE 4B illustrates 
the descriptbn ot each mnemonic. 

[0047] The mapping of functbnal units to instruction types determines which instructions can be executed in parallel, 
and therefore whether a fetch packet will become more than one execute packet. For example, if only the M unit can 
ss perform a multiply (MPY), an execute packet could have two MPY instructions, one to be executed by each of the two 
datapaths lid and lie. In contrast, the L, S. and D units are all capable of executing an add (ADD), thus an execute 
packet could contain as many as six ADD instructions. 

[0048] Referring again to FIGURE 3, the instructkxi's operand field folbws the opfield. Depending on the Instruction 
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typo, the operand field may identify one or more source registers, one or more constants, and a destination register 
[0049] FIGURE 5 is an example of code having multiple execute packets per fetch packet 20. In this example, there 
are two fetch packets 20. The first fetch packet 20 is executed In three execute packets. EP1, EP2, and EP3. The 
second fetch packet 20 is executed in four execute packets, EPl , EP2, EP3, and EP4. 
5 [0050] To generalize the above-described processor architecture, an executable Instruction word, i.e., an execute 
packet, contains up to eight instructions to be executed in parallel during a CPU cycle. Each Instruction in an execute 
packet uses a different one of the functional units (L. D. S or M) of datapaths 11 d and lie. The instruction mapping 
determines whrch instruction types can be duplicated within an execute packet. 

[0051J The use of instruction words in this manner lends itself to unique techniques for power optimization. As ex- 
10 plained below, within an Instruction word, instructions can be arranged so that, for each slot, changes from cycle to 
cycle are minimized. 

Power Optimization Process 

IS [0052] FIGURE 6 illustrates a code optimization process in accordance with an embodiment of the inventon. Each 
step involves a different code optimization technique. Each step could be performed alone as an independent code 
optimization technique, or in combination with one or more of the other steps. 

[0053] Each of these steps is explained below, together with one or more examples of code optimization in accord- 
ance with that step. The code examples are consistent with the architecture of processor 10 as described above in 
20 connection with FIGURES 1 - 5. Specifically, the examples are consistent with a processor 10 that uses fetch packets 
that may be divided into execute packets, and special considerations for this distinction between fetch packets and 
execute packets are noted. 

[0054] However, embodiments of the inventbn are equally useful for processors whose fetch packets are the same 
as the execute packets, as well as for processors that do not use "packets' in the conventional sense. The common 

25 characteristic of the code to be optimized is that it has "mulitple-instruction words". The temi "multiple- instruction word" 
is used to signify a set of instructions, where the instructions within the set are grouped at some point within the 
processor for processing (whrch may include fetching, decoding, dispatching, executing, or some combination of these 
functk)ns), and where the executing is by different functional units of the processor. The "multiple-instruction word" 
may be structured as a fetch packet, or as an execute packet, or it may have a structure different from a conventional 

30 packet structure. 

[0055] In general, each optimization technique is ultimately directed to finding and minimizing cycle4o-cycle bit 
changes in the binary representation of the assembly code. This is achieved without substantially affecting the overall 
tuncttonallty in temns of the number and type of instructions. Because the functbnallty is substantially the same, the 
result is less node switching when instructions are fetched from program memory and when they are decoded and 
3S dispatched. This in turn, reduces power consumption. Each step of the overall optimizaton process is directed to finding 
and minimizing a different category of bit changes. In a general sense, the code is scanned for various syntax features 
as opposed to functional features. 

[0056] Step 61 of the code optimization process is re-ordering slot assignments within fetch packets. For each fetch 
packet, the instructions are viewed by slot assignment. It Is determined whether instructions within a fetch packet can 
^ be re-ordered so that changing of functional units from cycle to cycle is minimized. The effect of Step 61 is a "vertical 
aligning' of functional unit assignments. 

[0057] FIGURES 7A and 7B illustrate an example of Step 61. FIGURE 7A shows an instruction stream 70 before 
the optimizatbn of Step 61 . FIGURE 78 shows almost the same instruction stream 70, optimized in accordance with 
Step 61. 

^ [0058] Instnjctbn stream 70 has three fetch packets. As illustrated^ in the second fetch packet, the optimizatbn of 
Step 61 moves an instojctlon having an ADD.L1 X opfleld to a slot In whfch there was an ADD.L1 opfield In the previous 
fetch packet. The opfieW is the same with the addition of an "X" signifying a cross path. In the third fetch packet, Step 
61 moves two instructions, one with an opfield ADD.L1X and the other with an opfield ADD.I^, to the same slots as 
instructkxis having corresponding opfieWs in the previous two fetch packets. Likewise, Step 61 moves the B (branch) 

50 instnjction so that the LDW. D2 instnjction may occupy the same slot as the LDW. D2 Instnjctions of the previous packets. 
A NOP (no operatbn) instruction is used as a place holder so that the same slots will have the same instruction type. 
[0059] Step 61 can be applied to fetch packets having nrwre than one execute packet. In this case, the order of the 
execute packets must be preserved, but stot assignments can be changed within an execute packet. In general, code 
having a single execute packet per fetch packet, such as the code of FIGURES 7A and 7B, will be optimized to a greater 

55 extent than code having multiple execute packets per fetch packet 

[0060] The above examples are specific to processor 10, whose instructions have an opfield containing both the 
instmctton type and the functkxial unit assignment. For other processors, the functional unit assignment may be in a 
different fiekJ. In any event, the optimization of Step 61 is directed to reordering instructions within fetch packets so 
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as to align functional unit asslgnnrisntG. This alignment of functional unit assignments reduces the numl3er of bits chang- 
ing in each slot from one cycle to the next. 

[0061] Step 63, like Step 61 . aligns functional unit assignments to avoid unnecessary switching between them. How- 
ever. Step 63 involves providing new functional unit assignments rather than re-ordering existing instructions. 

s [0062] Step 63 is based on the fact that there are certain Instructions that are executable by more than one type of 
functional unit. For example, referring again to FIGURE 4, processor 10 has certain instructions that can be executed 
on both the L and S functional units, and some of these can be executed on the D units as well. 
[0063] FIGURES 8A and BB are examples of unoptimized code and optimized code, respectively, where the optimi- 
zation has been perfomied in accordance with Step 63. As indicated, an instruction stream has three fetch packets, 

10 and each fetch packet has an ADD instruction in the same slot. The unoptimized code of FIGURE 8 A is executable 
because the ADD instruction can be performed on any of the functional units (D, S, or L). However, switching between 
them is unnecessary. Tlius, in FIGURE 8B. the same functk)nal unit (L) is used for all three ADD instructions. 
[0064] FIGURES 9A and 98 are another example of optimization in accordance with Step 63. This example Illustrates 
optimization of fetch packets having multiple execute packets. In this case, the cycle-to-cycle analysis of functional 

IS unrt assignments is directed to execute packets. However, the same concept would apply if the execute packets were 
fetched as letch packets. 

[0065] The optimization illustrated by FIGURES 9A and 9B is best understood by charting the cycle-by-cycle usage 
of the functional units. For the code of FIGURE 9A, which is the code before optimization, such a chart would be: 
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[0066] For the optimized code of FIGURE 9B. the chart would be: 
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40 

As in the example of FIGURES 8A and BB, f unctbnal units are re-assigned to avoid unnecessary switching between 
1unctk>nal units from cycle to cycle. The optimization results in better alignment of the functional units. 
[0087] Step 64 is directed to instructions having conditional field assignments. A characteristic of processor 10 is 
that the 3-bit conditional register field is all O's for an unconditional instmction. Conditions of registers BO. B1, and A1 
^ have only one "1" In the conditional field. On the other hand, conditions of registers B2 and A2 have two 'Vs*. Thus, 
to minimize the number of bits changing from unconditional instructtons to conditional instructions, registers BO, 81, 
and A1 are preferred. 

[0068] FIGURES 10A and 10B illustrate an example of Step 64. Comparing the unoptimized code of FIGURE 10A 
to the optimized code of FIGURE 108, in the first cycle, Step 64 exchanges the ADDs on S2 and D2. As a result of 
this modificatton: the number of bits changing in the conditbnal register field and operand field is reduced. Considering 
only Sbts 5 and 6, in the unoptimized code, the conditbnal and operand fiekjs are: 
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This results in 15 bit changes: 8 for the L2 instruction (2+2+2+2) and? for the D2 instruction (2+1+2+2). In the optimized 
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code, Sbt8 5 and 6, thasd fields are: 
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This results in 1 3 bit changes: 5 for the 12 instruction (0+1 +2-1-2) and 8 for the D2 instruction (2+2+2+2). This optimization 
reduces power usage by instruction dispatch unit 11b and instruction decode unit 11c. 

10 [0069] Step 65 of the optimization process analyzes the operand field ot the instructions. Operands are re-ordered 
or registers re-assigned, i1 this would result in a lower number ot bits changing in the operand field. As described above 
in connection with FIGURE 3, depending on the instructbn type, the operand field will identify various source registers, 
a destination register, or constants. It is a large field in proportion to the total bit size of the instruction. For example, 
for processor 10. the operand field is 15 bits of the 32-bit instructions. Thus. Step 65 can have an important effect on 

IS power optimization. 

[0070] FIGURES 11 A and 11 B are an example of optimization in accordance with Step 65. In this example, the re- 
ordering of operands is within an instruction. The unoptimized code of FIGURE 11 A is optimized in FIGURE 11 B. Two 
fetch packets are shown, with each fetch packet being executed in a single execute cycle. 
[0071] Considering only Stot #2 for each of the two cycles, the unoptimized code ot FIGURE 11 A is: 
20 
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The optimized code of FIGURE 11 B is: 
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The binary code for 11 is 1011, and the binary code for 12 is 1100. Thus, the re-ordering of the operands in slot #2 
reduces the number of bits changing in the operand field by six. 

[0072] FIGURES 12A and 12B are another example of Step 65, showing unoptimized code and the corresponding 
optimized code, respectively. Here, the reordering of operands involves a switch between two different instructions. 
Stots 2 and B of three fetch packets are shown. Comparing the fetch packets of the second cycle (FP2) of the unopti- 
mized code of FIGURE 12A to the optimized code of FIGURE 12B. the SUB Instructions on S2 and [2 have been 
switched. This reduces the number of bits changing in the operand fields of Slots 2 and 8. 

[0073] Step 65 can also be accomplished with an overall assessment of register use. When there is a choice of 
registers to use in a given instructk>n, the register that causes the fewest bits to change from the previous or next 
instmctbn can be selected. 

[0074] Step 67 is re-arranging NOP (no operation) instnjctions so as to provide a smoother code profile. More spe- 
cifically, Step 67 determines whether there are NOPs that can be moved from one fetch packet to another without 
affecting the functionality of the code. 

[0075] FIGURES 13A and 13B illustrate an example ot unoptimized code and the corresponding optimized code, 
respectively, where the optimization is in accordance with Step 67. The code has eight fetch packets, FP1 ...FP8. The 
shaded slots contain instructbns that are not NOP instructions. As illustrated in the example of FIGURE 1 3B, a number 
of NOP instructions have been moved from one fetch packet to another. Because a NOP instruction is all O's, their 
placement has a significant effect on the number of bits changing from cycle to cyde. 

[0076] Step 68 is adding dummy Instructions to reduce the number of times that a stot switches from NOP to a non- 
NOP instruction back to a NOP instructbn. These dummy instructions duplicate nnost of the previous or upcoming 
instmctbn without adversely affecting data integrity. 

[0077] FIGURES UAand 14B are an example of unoptimized code and the corresponding optimized code, respec- 
tively: where the optimization is in accordance with Step 68. Only a single slot of three fetch packets is shown. FIGURE 
14A is an example of unoptimized code, having a NOP instruction in Stot 2 in the second cycle. FIGURE 14B is the 
optimized code, where the NOP has been replaced with a dummy MPY instruction. The dunnmy instmctlon does not 
affect the integrity of the data because the result has been placed in a destinatton register, Bxx, which is an unused 



7 



EP0926 588 A2 



register in the code segment. Because the dummy instruction duplicates much of the preceding and following instruc- 
tions, the internal toggle activity of processor 10 is reduced. Step 68 is most effective for loop code segments, 
[0078] FIGURES 15A and 15B illustrate another example of unoptimized code and the corresponding optimized 
code, respectively, where the optimization is in accordance with Step 68. This example is of a code segment within a 

s loop. As in FIGURE 14A, in the unoptimized code of FIGURE 15A, in Slot 2, the instructions switch from a non-NOP 
to a NOP to a non-KJOR In the optimized code of FIGURE 1 5B, the dummy instructbn is a false conditional instruction. 
For false conditional instructions, the transfer of the result from functional unit to destination register is always disabled. 
A conditional register, 80, has been reserved for use with dummy instmctions. Before entering the loop, the conditional 
register is set to some value. In the example of FIGURES 15A and 15B. BO is used for the dummy instruction register 

10 and is also the loop counter. Because BO is nonzero until the final pass of Ihe loop, lor all but the final pass, the result 
of the conditional instruction is not written to A12. On the final pass, the result is written to A12, However, because 
A12 is not written to in the preceding instruction and is not used as a source in the following instruction, data integrity 
is not affected. In cycle 3. the instnjction writes to A12. which was the original function of the code. 
[0079] Typically, the optimal dummy instnjction for Step 68 will be a dummy instruction using a false conditional, 

^5 such as in the example of FIGURES 1 5A and 1 5B. However, in some cases, such as when a conditional register is not 
available, an alternative dummy instnjction, such as that of FIGURES 14A and 14B, may be used. As a result of Step 
68, fewer bits change state in the in-coming instmction stream from program memory 12. Also, fewer nodes change 
in decode unit 11c. 

[0080] Step 69 of the optimization process is to analyze address locations of fetch packets in program menrory 1 2. 
20 For sections of code that are executed repeatedly, such as in loops, the number of bits changing on program memory 
address lines can be minimized. 

[0081] As a simplified example of Step 69. assume that a first fetch packet of a \oop has address ....0111 and the 
next has the address . ..1000 in program memory 12. Each time the program memory 12 switches from accessing the 
first packet to accessing the second packet, tour address bits change. If the second packet were moved to address 
25 01 1 0, then on ly one bit wouM change. 

Automation of the Optimization Process 

[0082] Each of the above-described optimization techniques could be performed manually by an assembly code 
30 programmer. However, in nrwre sophisticated embodiments of the invention, one or more of the techniques are per- 
formed automatically, with a code generation tool. Such a tool would be programmed to detect code sequences in 
which a particular technique Is applicable and to perform the optimization called for by that technique. 
[0083] Some of the above-described steps are accomplished without aflecting the functionality of the code from one 
cycle to the next. These steps include Steps 61, 63, 64. 65, and 69. 
35 [0084] Other of the above-described steps are capable of affecting code functionality. These steps include Steps 67 
and 68. For these optimization techniques, the automated optimization process coukJ include heuristic rules to resolve 
functionality issues. Alternatively, the optimization process could output a message to the programmer, indicating thai 
an optimization might be possible at the programmer's option. 

^0 Other Embodiments 

[0085] Although an embodiment of the present invention has been described in detail, it should be understood that 
various changes, substitutions, and alterations can be made hereto without departing from the spirit and scope of the 
invention as defined by the appended claims. 

^ [0086] The scope of the present disclosure includes any novel feature or combination of features disclosed therein 
either explicitly of implicitly or any generalisation thereof irrespective of whether or not it relates to the claimed invention 
or mitigates any or all of the problems addressed by the present invention. The applicant hereby gives notice that new 
claims may be fomnulated to such features during the prosecutton of this application or of any such further application 
derived therefrom. In particular: with reference to the appended claims, features from dependent claims may be com- 

50 bined with those of the independent claims and features from respective independent claims may be combined in any 
appropriate manner and not merely in the specific combinations enumerated in the claims. 

[0087] Further and particular embodiments of the invention are enumerated in the following numbered statements. 

1 . A method of reducing power usage by a processor that processes multiple-instruction words, such that instnic- 
55 tons in each said of said words are executed by different functbnal units of said processor, during one or more 
processor cycles, comprising the steps of: 

comparing, for the first instnjction of each of a number of instruction wordS: functional unit assignments; 
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determining whether, from cycle to cycle, the number of bit changes In the binary representation of said func- 
tional unit assignments can be reduced; 

modifying at least one of said first instructions in accordance with said determining step; and 
repeating said comparing, determining, and nrK>difying steps tor each next instruction of said number of in- 
struction words. 

2. The method of statement 1, wherein said modifying step is performed by re-ordering instructions within said 
instruction words. 

3. The method of statement 1 or 2, wherein said modifying step is performed by replacing a functional unit assign- 
ment with another functional unit assignment. 

4. A method of reducing power usage by a processor that processes multiple-instruction words, such that instruc- 
tbns in each said of said words are executed by different functional units of said processor, during one or more 
processor cycles, comprising the steps of: 

comparing, for the first Instruction of each of a number of instruction words, operand fields; 

determining whether, from cycle to cycle, the number of bit changes in the binary representation of any of said 

operand fields can be reduced; 

modifying at least one of said first instructions in accordance with said determining step; and 

repeating said comparing, determining, and modifying steps tor each next instruction of said nurnber of in- 

stniction words. 

5. The method of statement 4, wherein said comparing, determining, and modifying steps are directed to operands 
within each said instruction, and wherein said modifying step is perfomned by re-ordering operands. 

6. The method of statement 4 or 5, wherein said comparing, determining, and modifying steps are directed to 
operands within each said instruction, and wherein said modifying step is performed by re-assigning operand 
locations. 

7. A method of reducing power usage by a processor that processes multiple-instruction words, such that instruc- 
tbns in each said of said words are executed by different functional units of said processor, during one or more 
processor cycles, comprising the steps of: 

comparing the first instruction of each of a number of instruction words; thereby detecting no-operation in- 
stnictions: 

determining whether, from cycle to cycle, the number of bit changes in the binary representations of any of 
said first instructions can be reduced; 

modifying at least one of said first instructions in accordance with said determining step; and 

repeat said comparing, determining, and rmJifying steps for each next instruction of said number of instruction 

words. 

8. The method of statement 7, wherein modifying step is performed by moving said no operation instojction from 
one of said instnjction words to another. 

9. The method of statement 7 or 8, wherein modifying step is performed by replacing said no operation instruction 
with a dummy instruction. 

10. A method of reducing power usage by a processor that processes multiple-instruction words, such that instnjc- 
tions in each said of said words are executed by different functional units of said processor, during one or more 
processor cycles, comprising the steps of: 

scanning said multiple-instruction words to locate one or more loops of said muttlple-lnstruction words; 
comparing the program memory addresses of said words within said loops; 

determining whether, from cycle to cycle, the number of bit changes in the binary representatbns of any of 
said program memory addresses can be reduced; and 

modifying at least one of said addresses in accordance with said determining step. 
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11 . A method for optimizing a computer program for minimum power comeumptlon by a computer executing said 
program, comprising the steps of: 

finding cycle-to-cycle bit changes in a binary representation of said program in assembly language code, 
minimizing cycle-to-cycle bit changes in said binary code by at least one of the folbwing: 
aligning functional unit assignments to reduce the number of bits changing each time slot of an instruction 
word from one cycle to the next, or 

for instructions executable by more than one functional unit assigning functional units to avoid unnecessary 
bit switching from cycle -to-cyclQ: or 

minimizing the number of bits changing caused by changing from unconditional to contilional instructions or 
vice versa, or 

reordering operand and/or register assignments to reduce the number of bits changing in operand fields, or 
moving non-NOPs between fetch packets without affecting code functionality, or 

adding dummy instructions to reduce the number of times an instruction word sbt switches from NOP to non- 
NOP to NOP without affecting data integrity, or 

modifying address sequences to minimize the numberof address bits that change between execution packets. 



Claims 

1 . A method tor reducing power usage by a processor that processes multiple-instruction words, such that instructions 
in each said of said words are executed by different functional units of said processor, during one or more processor 
cycles, comprising the steps of: 

comparing the syntax of a number of said instruction words; 

determining whether, from cycle to cycle, the number of bit changes in the binary representations of any of 
said instruction words can be reduced by changing bits without substantially affecting functionality of said 
instruction words; and 

modifying at least one of said Instruction words in accordance with said determining step. 

2. The method of Claim 1 . wherein said comparing, determining, and modifying steps are directed to a functional unit 
identifier within each said instruction, and wherein said modifying step is performed by reordering instructions 
within said instruction words. 

3. The method of Claim 1 or Claim 2, wherein said comparing, detemiining, and modifying steps are directed to a 
functional unit assignment within eabh said instruction, and wherein said modifying step is performed by replacing 
said functional unit assignment with another functional unit assignment 

4. The method of any of Claims 1 to 3, wherein said comparing, determining, and modifying steps are directed to a 
conditional register assignment within each said instruction, and wherein said modifying step is performed by re- 
assigning a conditional register. 

5. The method of any of claims 1 to 4, wherein said comparing, determining, and modifying steps are directed to 
operands within each said instruction, and wherein said modifying step is performed by re-ordering operands. 

6. The method of any of claims 1 to 5, wherein said comparing, determining, and modifying steps are directed to 
operands within each said instruction, and wherein said modifying step is performed by re-assigning operand 
locations. 

7. The method of any of claims 1 to 6, wherein said comparing, determining, and riKxiifying steps are directed to no- 
operation instnjctions, and wherein said modifying step is performed by nrwving said no-operation instruction from 
one of said instruction words to another. 

8. The method of any of claims 1 to 7, wherein said comparing, determining, and modifying steps are directed to no- 
operatbn instructions, and wherein said nxxJifying step is performed by replacing said no-operation instructions 
with dummy instructions. 

9. The method of any of claims 1 to 6, wherein said processor is a very long instruction word processor. 
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10. The method ot any of claims 1 to 9. wherein said processor is a dual datapath processor. 

11. The method of any of claims 1 to 10, wherein said multiple instmction words are fetch packets, such that all in- 
structions in each of said instruction words are fetched from a memory at substantially the same time. 



12. The method of any of Claims 1 to 11 , further including: 

scanning said multiple-instruction words to locate one or more loops of said multiple-instruction words; and 
wherein said comparing, determining, and modifying steps are directed to reducing the number of bit changes 
in the binary representalbns of any of said program memory addresses. 
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